NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors

https://doi.org/10.18653/v1/2024.acl-long.674

Dugan, Liam; Hwang, Alyssa; Trhlík, Filip; Zhu, Andrew; Ludan, Josh Magnus; Xu, Hainiu; Ippolito, Daphne; Callison-Burch, Chris (January 2024, Association for Computational Linguistics)

Many commercial and open-source models claim to detect machine-generated text with extremely high accuracy (99% or more). However, very few of these detectors are evaluated on shared benchmark datasets and even when they are, the datasets used for evaluation are insufficiently challenging—lacking variations in sampling strategy, adversarial attacks, and open-source generative models. In this work we present RAID: the largest and most challenging benchmark dataset for machine-generated text detection. RAID includes over 6 million generations spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding strategies. Using RAID, we evaluate the out-of-domain and adversarial robustness of 8 open- and 4 closed-source detectors and find that current detectors are easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models. We release our data along with a leaderboard to encourage future research.
more » « less
Full Text Available
Exploring the Curious Case of Code Prompts

https://doi.org/10.18653/v1/2023.nlrse-1.2

Zhang, Li; Dugan, Liam; Xu, Hainiu; Callison-burch, Chris (June 2023, Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE))

Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code-prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some (but not all) tasks and that fine-tuning on text instructions leads to better relative performance of code prompts.
more » « less
Full Text Available
Causal Reasoning of Entities and Events in Procedural Texts

https://doi.org/10.18653/v1/2023.findings-eacl.31

Zhang, Li; Xu, Hainiu; Yang, Yue; Zhou, Shuyan; You, Weiqiu; Arora, Manni; Callison-Burch, Chris (May 2023, Findings of the Association for Computational Linguistics: EACL 2023)

Entities and events are crucial to natural language reasoning and common in procedural texts. Existing work has focused either exclusively on entity state tracking (e.g., whether a pan is hot) or on event reasoning (e.g., whether one would burn themselves by touching the pan), while these two tasks are often causally related. We propose CREPE, the first benchmark on causal reasoning of event plausibility and entity states. We show that most language models, including GPT-3, perform close to chance at .35 F1, lagging far behind human at .87 F1. We boost model performance to .59 F1 by creatively representing events as programming languages while prompting language models pretrained on code. By injecting the causal relations between entities and events as intermediate reasoning steps in our representation, we further boost the performance to .67 F1. Our findings indicate not only the challenge that CREPE brings for language models, but also the efficacy of code-like prompting combined with chain-of-thought prompting for multihop event reasoning.
more » « less
Full Text Available
Human-in-the-loop Schema Induction

https://doi.org/10.18653/v1/2023.acl-demo.1

Zhang, Tianyi; Tham, Isaac; Hou, Zhaoyi; Ren, Jiaxuan; Zhou, Leon; Xu, Hainiu; Zhang, Li; Martin, Lara; Dror, Rotem; Li, Sha; et al (January 2023, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics)

Schema induction builds a graph representation explaining how events unfold in a scenario. Existing approaches have been based on information retrieval (IR) and information extraction (IE), often with limited human curation. We demonstrate a human-in-the-loop schema induction system powered by GPT-3. We first describe the different modules of our system, including prompting to generate schematic elements, manual edit of those elements, and conversion of those into a schema graph. By qualitatively comparing our system to previous ones, we show that our system not only transfers to new domains more easily than previous approaches, but also reduces efforts of human curation thanks to our interactive interface.
more » « less
Full Text Available

Search for: All records